Linear Thompson Sampling Revisited
نویسندگان
چکیده
We derive an alternative proof for the regret of Thompson sampling (TS) in the stochastic linear bandit setting. While we obtain a regret bound of order e O(d3/2 p T ) as in previous results, the proof sheds new light on the functioning of the TS. We leverage on the structure of the problem to show how the regret is related to the sensitivity (i.e., the gradient) of the objective function and how selecting optimal arms associated to optimistic parameters does control it. Thus we show that TS can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional p d regret factor compared to a UCB-like approach. Furthermore, we show that our proof can be readily applied to regularized linear optimization and generalized linear model problems.
منابع مشابه
Linear Thompson Sampling Revisited A Examples of TS distributions
A Examples of TS distributions Example 1: Uniform distribution ⌘ ⇠ UBd(0,d). The uniform distribution satisfies the concentration property with constants c = 1 and c0 = e d by definition. Since the set {⌘|uT⌘ 1}\Bd(0, p d) is an hyper-spherical cap for any direction u of Rd, the the anti-concentration property is satisfied provided that the ratio between the volume of an hyper-spherical cap of ...
متن کاملLinear preservers of Miranda-Thompson majorization on MM;N
Miranda-Thompson majorization is a group-induced cone ordering on $mathbb{R}^{n}$ induced by the group of generalized permutation with determinants equal to 1. In this paper, we generalize Miranda-Thompson majorization on the matrices. For $X$, $Yin M_{m,n}$, $X$ is said to be Miranda-Thompson majorized by $Y$ (denoted by $Xprec_{mt}Y$) if there exists some $Din rm{Conv(G)}$ s...
متن کاملThompson Sampling for Multi-Objective Multi-Armed Bandits Problem
The multi-objective multi-armed bandit (MOMAB) problem is a sequential decision process with stochastic rewards. Each arm generates a vector of rewards instead of a single scalar reward. Moreover, these multiple rewards might be conflicting. The MOMAB-problem has a set of Pareto optimal arms and an agent’s goal is not only to find that set but also to play evenly or fairly the arms in that set....
متن کاملThompson Sampling for Online Learning with Linear Experts
In this note, we present a version of the Thompson sampling algorithm for the problem of online linear generalization with full information (i.e., the experts setting), studied by Kalai and Vempala, 2005. The algorithm uses a Gaussian prior and time-varying Gaussian likelihoods, and we show that it essentially reduces to Kalai and Vempala’s Follow-thePerturbed-Leader strategy, with exponentiall...
متن کاملSOLVING FUZZY LINEAR PROGRAMMING PROBLEMS WITH LINEAR MEMBERSHIP FUNCTIONS-REVISITED
Recently, Gasimov and Yenilmez proposed an approach for solving two kinds of fuzzy linear programming (FLP) problems. Through the approach, each FLP problem is first defuzzified into an equivalent crisp problem which is non-linear and even non-convex. Then, the crisp problem is solved by the use of the modified subgradient method. In this paper we will have another look at the earlier defuzzifi...
متن کامل